UDFs using Code in Cloud Storage

This notebook shows you how to use Javascript UDFs (user-defined functions) in Google BigQuery that reference Javascript code stored in Google Cloud Storage. Storing your UDF support code in Cloud Storage allows you to reuse well tested code and share it across multiple notebooks.

Before using this tutorial, you should go through the UDFs in BigQuery tutorial, which discusses how to use UDFs in notebooks without external code, and the UDF Testing in the Notebook tutorial, which shows you how to run and test your Javascript code in the notebook.

You can read more about UDFs here.

Scenario

This notebook repeats a scenario presented in other notebooks: looking at anonymized logs that originated in Google AppEngine.

With BigQuery, it is possible to store and reference your UDFs in Google Cloud Storage. Let's see how we can do this within Datalab

Moving the Code to GCS

Recall from the previous notebook UDF Testing in the Notebook how we can add a Javascript UDF and test it. We can use the Javascript IPython kernel module to communicate the UDF code over to the Python environment after we test it, where we can then use it in other Python cells:



In [1]:

    
%%javascript

/**
 * Our UDF function, which splits a set of URL query parameters into an array
 */
function getParameters(path) {
  var re = /[?&]([^=]*)=([^&]*)/g;
  var result = [];
  var match;
  while ((match = re.exec(path)) != null) {
    result.push(decodeURIComponent(match[2]));
  }
  return result;
}

// Now we want to test the UDF. We can try calling it using a sample line from our table.
// Note that the variable `element` is available to us to create output in the notebook,
// so our test emitter will use that variable to display the fields.
var test_path = '/log/page?project=14&instance=81&user=16&page=master&path=63&version=0.1.1&release=alpha'
var test_results = ['14', '81', '16', 'master', '63', '0.1.1', 'alpha']

var params = getParameters(test_path);
var code = 'JsUDFCode = None'
if (JSON.stringify(params) == JSON.stringify(test_results)) {
  element.append('OK')
  code = 'JsUDFCode = """' + getParameters.toString() + '"""';
} else {
  element.append('Failed')
}

IPython.notebook.kernel.execute(code);

Make sure we copied over the code from Javascript. If this cell fails, it is because the execute(code) call above has not finished.



In [4]:

    
print(JsUDFCode)









    



function getParameters(path) {
  var re = /[?&]([^=]*)=([^&]*)/g;
  var result = [];
  var match;
  while ((match = re.exec(path)) != null) {
    result.push(decodeURIComponent(match[2]));
  }
  return result;
}

Now that the testing is done, let's create a file in Cloud Storage to hold the bulk of the code. We can do that in the notebook. The name of the bucket will be project dependent, so you will need to complete and execute this code.



In [6]:

    
from google.datalab import Context
import google.datalab.storage as gs

# Get a bucket in the current project
project = Context.default().project_id
sample_bucket_name = project + '-datalab-udf-samples'

# Create the storage bucket and code library object
sample_bucket = gs.Bucket(sample_bucket_name)
sample_bucket.create()
sample_object = sample_bucket.object('udf_library.js')
sample_object.write_stream(JsUDFCode, 'application/javascript')

# Print the URI of the library object to use in @import
print(sample_object.uri)









    



gs://mysampleproject-datalab-udf-samples/udf_library.js

Once the code is copied to Cloud Storage, we can refer to it in the UDF jsdoc comment header using @import. You can have more than one @import if necessary. Note that in the cell below, you need to change the @import path to include your project instead of mysampleproject. You can use the URI output from the cell above as the argument to @import.



In [10]:

    
%%bq udf -n externalized_udf -l js
// @returns ARRAY<STRING>
// @param path STRING
// @import gs://mysampleproject-datalab-udf-samples/udf_library.js
return getParameters(path)

And now we can test it:



In [13]:

    
%%bq query --name log_query --udfs externalized_udf
SELECT *, externalized_udf(path) as parameters FROM `cloud-datalab-samples.appenginelogs.sample_logs_20151027`
LIMIT 5



In [15]:

    
%bq execute --query log_query









    Out[15]:





    timestamp method status latency path parameters
2015-10-27 22:00:47.660171 POST 204 0.00337 /log/signin?project=5&instance=2&user=54&page=master&path=3&version=0.1.1&release=alpha ['5', '2', '54', 'master', '3', '0.1.1', 'alpha']
2015-10-27 01:03:10.959946 POST 204 0.003195 /log/page?project=36&instance=40&user=131&page=detail&path=38&version=0.1.1&release=alpha ['36', '40', '131', 'detail', '38', '0.1.1', 'alpha']
2015-10-27 01:24:18.065954 POST 204 0.003023 /log/page?project=20&instance=29&user=42&page=detail&path=6&version=0.1.1&release=alpha ['20', '29', '42', 'detail', '6', '0.1.1', 'alpha']
2015-10-27 00:57:44.694484 POST 204 0.003418 /log/start?project=143&instance=215&user=2&page=master&path=3&version=0.1.1&release=alpha ['143', '215', '2', 'master', '3', '0.1.1', 'alpha']
2015-10-27 20:10:19.547390 POST 204 0.00368 /log/start?project=149&instance=232&user=2&page=master&path=3&version=0.1.1&release=alpha ['149', '232', '2', 'master', '3', '0.1.1', 'alpha']
    
(rows: 5, time: 4.5s,   256KB processed, job: job_tNZqe-dwx_o6t0McL76Egeu9FHA)

If we inspect the query to see its expanded SQL, we can see all it has is a simple UDF that imports some code from the Cloud Storage location we stored it in.



In [20]:

    
log_query









    



CREATE TEMPORARY FUNCTION externalized_udf (path STRING)
RETURNS ARRAY<STRING>
LANGUAGE js
AS """
// @returns ARRAY<STRING>
// @param path STRING
// @import gs://mysampleproject-datalab-udf-samples/udf_library.js
return getParameters(path)
"""
OPTIONS (
library="gs://mysampleproject-datalab-udf-samples/udf_library.js"
);
SELECT *, externalized_udf(path) as parameters FROM `cloud-datalab-samples.appenginelogs.sample_logs_20151027`
LIMIT 5

Cleaning Up

Since this is a tutorial, we should clean up the objects we created in Cloud Storage.



In [21]:

    
sample_object.delete()
sample_bucket.delete()

timestamp	method	status	latency	path	parameters
2015-10-27 22:00:47.660171	POST	204	0.00337	/log/signin?project=5&instance=2&user=54&page=master&path=3&version=0.1.1&release=alpha	['5', '2', '54', 'master', '3', '0.1.1', 'alpha']
2015-10-27 01:03:10.959946	POST	204	0.003195	/log/page?project=36&instance=40&user=131&page=detail&path=38&version=0.1.1&release=alpha	['36', '40', '131', 'detail', '38', '0.1.1', 'alpha']
2015-10-27 01:24:18.065954	POST	204	0.003023	/log/page?project=20&instance=29&user=42&page=detail&path=6&version=0.1.1&release=alpha	['20', '29', '42', 'detail', '6', '0.1.1', 'alpha']
2015-10-27 00:57:44.694484	POST	204	0.003418	/log/start?project=143&instance=215&user=2&page=master&path=3&version=0.1.1&release=alpha	['143', '215', '2', 'master', '3', '0.1.1', 'alpha']
2015-10-27 20:10:19.547390	POST	204	0.00368	/log/start?project=149&instance=232&user=2&page=master&path=3&version=0.1.1&release=alpha	['149', '232', '2', 'master', '3', '0.1.1', 'alpha']